透過 edX - Large Language Modes: Application through Production 課程提到的 LLMOps Notebook 來探討。這篇會說明 Notebook 內容與在 Databricks UI 的驗證。
透過 Classroom Setup
設定好環境,這個 Notebook 會使用 Extreme Summarization (XSum) Dataset 以及 HuggingFace 的 T5 Text-to-Text Transfer Transformer 當作範例的資料集和語言模型。
透過 load_dataset
取得資料集。
from datasets import load_dataset
from transformers import pipeline
xsum_dataset = load_dataset(
"xsum", version="1.2.0", cache_dir=DA.paths.datasets
) # Note: We specify cache_dir to use pre-cached data.
指定 prod_data_path
以及 test_spark_dataset
來儲存資料集。
prod_data_path = f"{DA.paths.working_dir}/m6_prod_data"
test_spark_dataset = spark.createDataFrame(xsum_dataset["test"].to_pandas())
test_spark_dataset.write.format("delta").mode("overwrite").save(prod_data_path)
summarizer
pipeline。from transformers import pipeline
# Later, we plan to log all of these parameters to MLflow.
# Storing them as variables here will help with that.
hf_model_name = "t5-small"
min_length = 20
max_length = 40
truncation = True
do_sample = True
summarizer = pipeline(
task="summarization",
model=hf_model_name,
min_length=min_length,
max_length=max_length,
truncation=truncation,
do_sample=do_sample,
model_kwargs={"cache_dir": DA.paths.datasets},
) # Note: We specify cache_dir to use pre-cached models.
之前有說明過可以透過 MLflow tracking server 來紀錄 experiment 的資訊,這邊除了透過 mlflow.start_run
來開始紀錄,裡面也用 mlflow.llm.log_predictions
來設定 experiment 的路徑,。
import mlflow
# Tell MLflow Tracking to user this explicit experiment path,
# which is in your home directory under the Workspace browser (left-hand sidebar).
mlflow.set_experiment(f"/Users/{DA.username}/LLM 06 - MLflow experiment")
with mlflow.start_run():
# LOG PARAMS
mlflow.log_params(
{
"hf_model_name": hf_model_name,
"min_length": min_length,
"max_length": max_length,
"truncation": truncation,
"do_sample": do_sample,
}
)
# --------------------------------
# LOG INPUTS (QUERIES) AND OUTPUTS
# Logged `inputs` are expected to be a list of str, or a list of str->str dicts.
results_list = [r["summary_text"] for r in results]
# Our LLM pipeline does not have prompts separate from inputs, so we do not log any prompts.
mlflow.llm.log_predictions(
inputs=xsum_sample["document"],
outputs=results_list,
prompts=["" for _ in results_list],
)
# ---------
# LOG MODEL
# We next log our LLM pipeline as an MLflow model.
# This packages the model with useful metadata, such as the library versions used to create it.
# This metadata makes it much easier to deploy the model downstream.
# Under the hood, the model format is simply the ML library's native format (Hugging Face for us), plus metadata.
# It is valuable to log a "signature" with the model telling MLflow the input and output schema for the model.
signature = mlflow.models.infer_signature(
xsum_sample["document"][0],
mlflow.transformers.generate_signature_output(
summarizer, xsum_sample["document"][0]
),
)
print(f"Signature:\n{signature}\n")
# For mlflow.transformers, if there are inference-time configurations,
# those need to be saved specially in the log_model call (below).
# This ensures that the pipeline will use these same configurations when re-loaded.
inference_config = {
"min_length": min_length,
"max_length": max_length,
"truncation": truncation,
"do_sample": do_sample,
}
# Logging a model returns a handle `model_info` to the model metadata in the tracking server.
# This `model_info` will be useful later in the notebook to retrieve the logged model.
model_info = mlflow.transformers.log_model(
transformers_model=summarizer,
artifact_path="summarizer",
task="summarization",
inference_config=inference_config,
signature=signature,
input_example="This is an example of a long news article which this pipeline can summarize for you.",
)
Reference: